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Abstract. The Postnikov character formula is used to express large portions 
of a Dirichlet character sum in terms of quadratic exponential sums. The 
quadratic sums are then computed using an analytic algorithm previously de- 
rived by the author. This leads to a power-saving if the modulus is smooth 
enough. As an application, a fast, and potentially practical, method to com- 
pute Dirichlet L-functions with complexity exponent 1/3 for smooth enough 
moduli is derived. 

1. Introduction 

Fast algorithms for numerically evaluating L-functions at individual points have 
generated some interest recently. The asymptotic power-savings achieved by such 
algorithms have shed light on the basic nature of L-functions. Also, given the 
wider availability of more powerful computers, there is better hope to reach regimes 
where such algorithms become practical, even if they are quite involved. One of 
these algorithm has already been used to compute the Riemann zeta function in 
small neighborhoods of many large values, and to examine the zero distributions 
there. The algorithms can in some cases be combined with amortized-complexity 
(or average cost) methods, such as |H3j . to enable large-scale numerical studies at 
very large height and level, and in windows of considerable size, which is useful for 
testing random matrix theory predictions for L-functions. 

New algorithms for computing the Riemann zeta function £(s) at individual 
points were derived in |H1( IH2| . One algorithm is capable of numerically evaluating 
£(l/2+ii) for t > 1 with an error bounded by 0(t~ x ) using t 1 / 3 +°^( 1 ) bit operations. 
Another faster algorithm requires t 4 / 13 +°>( 1 ) bit operations and t 4 / 13 +°^( 1 ) bits of 
storage. There are interesting analogies between these algorithms, which rely on fast 
computation of exponential sums, and methods for proving subconvexity estimates 
for zeta, meaning bounds of the form ^(1/2) < 1/4 — 5', for some 6' > 0, where 
/x^(l/2) is the infimum of all the numbers r\ such that C(l/2 + it) = 0(\t\ v ). For 
example, both the t 1 / 3 +° A ( 1 ) algorithm and the Weyl-Hardy-Littlewood method 
(which yields the bound /^(l/2) < 1/6; see [H 5.3]) start out by subdividing the 
"main sum" of zeta in a similar way. In both cases, the main difficulty is reduced 
to understanding the quadratic exponential sums K~ J J2o<k<K & e 27Vtak+27rl @ k , 
where a, (3 6 [0, 1). But unlike the Weyl-Hardy-Littlewood method, which detects 
cancellation in the quadratic sum due to the linear argument a solely, the algorithm 
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in |H1| for computing quadratic sums, which will play an essential role in this 
paper, relies in a crucial way on controlling the length of the sum via the quadratic 
argument j3, through repeated applications of van der Corput methods, and in a 
process reminiscent of the method of exponent pairs of van der Corput and Phillips 
(see [H 5.20]). 

Van der Corput methods were also the inspiration behind the i 4 / 13 + A(!) algo- 
rithm, but that time one was led to cubic sums K~ j J2o<k<K k j e 27riak+27T ^ k + 27r " ; 
where the situation was much more complicated (even though the range of l hap- 
pened to be very restricted in application). The t 4 / 13 +°A( 1 ) algorithm relies, in 
addition, on a fast Fourier transform (FFT) precomputation, which resembles the 
precomputation in Schonhage's method [5] for computing zeta in ^ 3 / 8 +°a(i) j n light 
of the similarities mentioned so far between algorithms and subconvexity estimates 
for zeta, it might be of interest to try to understand the complexity exponent 4/13 
in the subconvexity context. 

In the GL 2 setting, Vishe has derived an algorithm in jVlj that permits the 
accurate computation of £(1/2 + it, /), where / is a fixed modular cusp form (so 
big-0 constants depend on /), holomorphic or Maass, of weight k for a congruence 
subgroup of SL 2 (Z), using 0(1 + i 7 / 8+c ) operations, where t > and e' is any 
fixed positive real. The method employed in |Vlj is described as geometric, and is 
closely related to subconvexity estimates for GL 2 L-functions. It ultimately uses 
a direct precomputation (not involving the FFT) to obtain its power-savings. The 
method relies on an integral representation of L(s, /), and has the useful feature 
that only the first few coefficients of the cusp form are needed for computing the 
corresponding L-function. Vishe has also derived an algorithm in |V2j for com- 
puting the central value L(l/2, j\ x x), where x is a character mod q and /i is a 
fixed cusp form of weight k for the full modular group, with complexity as good as 
^5/6+0(1) wn en q j s highly composite. His method is closely related to recent work 
of Venkatesh on subconvexity estimates for L(l/2, j\ x x), 

The goal of this article is to extend the investigations in |HlllH2j to include the so- 
called g-aspect. We derive a fast, and potentially practical, algorithm for computing 
L(s, x), where \ is a character mod q, with an error bounded by 0(q~ A (|s| + 1)~ A ) 
in complexity <j' 1 / 3+OA ( 1 )(|s| + l) 1 / 3 +°^( 1 ) when q is smooth enough, where smooth 
enough means that the radical of q is small enough compared with q, or q is power- 
full. The method applies uniformly in q and s. It performs least well when q is 
square- free (e.g. if x is a real primitive character), in which case it requires about 
q(\s\ + l) 1 / 3 +° A ( 1 ) time. The improvements obtained here represent the first power- 
saving in the g-aspect over previous algorithms (which consume q 1 / 2 +° A ( 1 ) time) 
for any class of Dirichlct L-functions. 

Part of the story of the algorithm is the general analogy between the t-aspect 
and the depth aspect (highly power- full moduli such as q = p a , a — > oo) in the 
theory of character sums and Dirichlet L-functions. The power- full structure of the 
modulus will play an important role in the algorithm. We note though that it is 
not necessary for the exponent a to be large in order for the algorithm to perform 
its best. The running time <7 1 /3+oa(i) (| s | _|_ ^i/3+o a (i) - g gt -u achieved if simply 
q = p a and 3 | a (for example, if q = p 3 ), and, more generally, if the exponents of 
the prime factors of q are divisible by 3. 



COMPUTING CHARACTER SUMS 



3 



Our algorithm is related to proofs of subconvexity estimates for L(l/2 + it, %), 
which further strengthens the apparent connection between algorithms and subcon- 
vexity estimates for L- functions. Indeed, one of the essential steps is a specialization 
of the Postnikov character formula, stated in lemma [3"T2l where we exploit the power- 
full structure of the modulus q. Postnikov's formula was employed by Barban, 
Linnik, and Tshudakov in |BLT| to study the same family of Dirichlet L-functions 
tackled here. They proved the estimate ~^2 n<N x( n ) VNq 1 / 6 (log q) 1 / 2 , where 
q = p n , p > 3 is any fixed prime, n > n , N < q 2 ^ 3 , and x mod q is a non-principal 
character, Q from which the estimate \L(l/2 + it,x)\ *C (\t\ + l)g 1//6 (logg r ) 3 / 2 was 
deduced. 

The connection between algorithms and subconvexity estimates is somewhat 
puzzling, especially since, a priori, there is no compelling reason for it to exist. 
Fast algorithms work because they are able to express the L-function using fewer 
"terms" (whose sizes matter on a logarithmic scale only) , whereas subconvexity esti- 
mates rely on detecting cancellation among terms, and involve certain critical steps 
that are too crude for computation, or for which there is no simple computational 
analogue, such as elementary applications of the Cauchy-Schwarz inequality. Also, 
despite the parallels between the t 1 / 3 +°^( 1 ) algorithm and both, the Weyl-Hardy- 
Littlcwood method and the method of exponent pairs, the t 1 ' s+ ° x ^ 1 ' algorithm still 
does not yield the bound /Ltf(l/2) < 1/6. This is because the algorithm does not 
guarantee any cancellation should occur in the quadratic sum as it repeatedly ap- 
plies van der Corput methods to it. The only way the algorithm could sense (or 
roughly distinguish) the size of the quadratic sum is via the total number of iter- 
ations it uses in the computation. But this number is only poly-log in the length 
of the sum, and, therefore, variations in it do not affect the power-savings. Fur- 
thermore, consider the main difficulty in improving the £ 4 /!3+o A (i) algorithm is not 
that certain terms are getting too large, but that a certain FFT precomputation, for 
which there is no clear analogue when bounding /if (1/2), becomes too expensive. 

We note that the algorithm for L(s,x), which is stated in Theorem 11.11 actu- 
ally relies on a character sum estimate, namely the Polya- Vinogradov inequality, 
to obtain an upper bound for | Xm>M x( n ) rl ~ s |; which reduces the computation of 
L{s, x) to computing a main sum J2n<M xi n ) n ~ s \ where M is chosen according to 
the desired precision; see (|4.5|) . However, our use of the Polya- Vinogradov inequal- 
ity is not essential because, as the discussion in 2] will show, it can be replaced by 
the trivial estimate | J2n<N xi n )\ < 1 easily. In general, the available character sum 
estimates help us to obtain a shorter main sum for L(l/2+it, x), but not enough to 
improve the algorithmic power-savings. There are examples of algorithms, though, 
that achieve their saving by directly using a character sum estimate to obtain a 
relatively short main sum when -ft(s) is large enough. For instance, the algorithm 
of Booker [B] , which can certify the output of Buchmann's conditional algorithm 
for computing the class number of a quadratic number field Q(Vd) in time |d| 1 / 4+e 
if the output of Buchmann's algorithm is correct (as expected) and in time |<i| 1 / 2+e 
otherwise, directly uses Burgess' theorem (see |GLj ) to obtain a bound on the trun- 
cation error in a smoothed approximate functional equation for L(l,x), which is 

4t was mentioned in BLT lemma 6] that the estimate ^2n< N x( n ) ^ V / iV<? 1/ ' 6 (log g) 1 / 2 would 
still hold for N > q 2 / 3 since it would be a consequence of the Polya- Vinogradov inequality. But 
this seems to miss an extra factor of (log q) 1 / 2 , which in turn impacts the bound for \L(l/2 + it, \) 
by an extra factor of (logg) 1 / 2 . Nevertheless, we stated the bounds here the same as in |BLT| . 



4 



G.A. HIARY 



shown to suffice considering the class number needs to be computed to within half 
an integer only. 

In the remainder of the introduction, we overview the structure of the paper and 
discuss our underlying computational model, followed by a formal statement for 
the algorithm for L(s, x)- 

In Sj2j we provide background material on some previous methods, which will help 
place the algorithmic improvements obtained here into context. We also provide a 
sketch of our method for computing the character sums 

(1-1) S X (K):= ]T X {k), 

0<k<K 

which is the main step towards the algorithm for L(s, x)- Notice that, while S X (K) 
and L(s,x) are directly related, via L(s, x) = s S x (x)x~ s ~ 1 dx, 5i(s) > 0, for 
example, it is not in general immediate (or necessary) that savings achieved in 
computing S X (K) must translate to savings in computing L(s,x)- In our case, 
though, the method for computing S X (K) generalizes naturally to L{s,x)- In Sj3j 
we state and prove several needed results, starting with the simpler case of S X (K). 
As an application, we prove in 2]the complexity of the algorithm for L(s, x)- Last, 
in Sj5j we remark on the general modulus case. 

We now specify our computational model. Real numbers are represented using 
a fixed point system in base-2. So, when we write that operations are performed 
using £>-bit arithmetic, it means that real numbers are represented using B bits 
to the left of the radix point (integer bits), and B bits to the right of the radix 
point (fractional bits), together with a sign bit (specifying whether the number is 
positive or negative). The position of the radix point is fixed. This system can 
accommodate integers easily by removing the fractional bits entirely, and requiring 
an extra bit (flag) to tell whether the number is an integer. An integer can be 
coerced into a real number by padding it with all zero fractional bits then flipping 
the integer flag, and vice versa. We can represent a real number x using a 6-bit 
fixed system if log |x|/ log 2 < B, otherwise there is an overflow problem. Notice 
that, if there is no overflow, then x can be represented with a round-off error of 
±2~ e . Similarly, to represent an integer m exactly, we need log |m|/ log 2 < B. Our 
theorems and lemmas will always request large enough values of B to guarantee no 
overflow ever occurs. So, when we write that it suffices to use £>-bit arithmetic in a 
certain algorithm, it implicitly means that log M. / log 2 < B, where M. is the largest 
number that ever occurs in the algorithm (regardless of the order of operations). 
The number Ai depends on s and q only. It will be tracked during the derivations, 
and will manifestly satisfy a bound of the form log Ai < Ao((X + 1) log(g(|s| + l))) 4 
for some absolute constant Aq, which is quite generous. 

Our basic real operations are addition, multiplication, division by a non zero 
number, the cosine and the sine, exponentiation, and taking the logarithm of a 
positive number. We take for granted that there are algorithms to perform each of 
these operations using < AjB Kj bit operations and < A 2 jB K2j bits of storage, where 
Aj, A 2 j, iij, and R,2j, j — 1,2,3,4, are absolute numbers. In addition, we assume 
that the final output of any basic operation is correct to within ±A42~ B+:F , where T 
is an absolute constant (though, typically, the precision is much better) . Let C be an 
upper bound on the number of basic real operations of a given algorithm. Then, an 
upper bound for the round-off error accumulated during a full run of the algorithm is 
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CA42 B+Jr . So, if the final output is requested with an error tolerance e € (0, e 1 ), 
then it suffices to ensure eM2~ B+:F < e, or (logA^+logC+log(l/e))/log2+J r < B, 
which is how we determine upper bounds for the required number of bits in our 
algorithms. 

We make a similar assumption about the existence of algorithms to perform the 
basic integer operations: Addition, multiplication, and checking equality of two 
integers, such that operations can be performed exactly in the fixed point system, 
provided there is no overflow problem. (Division of integers is carried out using 
real numbers.) We assume that the elements of the ring Z/p a Z are modelled by (or 
identified with) the set of numbers {0, . . . ,p a — 1}, in the obvious way. The basic 
ring operations are addition, multiplication, and checking equality of two elements. 
Determining the multiplicative inverse, if it exists, of an element in Z/p a Z can be 
done via the Euclidean algorithm, which is fast. Ring operations are performed 
exactly using integer arithmetic in the fixed point system and by reducing modulo 
p a . 

Our algorithms will sometimes request to perform a precomputation, and to 
store the output in memory for later retrieval. We assume that any randomly- 
chosen precomputed value can be quickly retrieved from memory in roughly the 
same amount of time. This is usually a realistic assumption in core memory areas, 
and is one feature of the Random Access Machine model in complexity analysis, 
as mentioned in |LMO] for example. (This assumption is actually not essential for 
Theorem ll.il but this is not important.) For the sake of definiteness, the following 
suffices in our derivations. Suppose the precomputation results in T numbers say, 
where each number is represented using a B-bit fixed point system, then we assume 
that the cost of retrieving a precomputed number is uniformly bounded by < Ag (B+ 
logT) K!> bit operations, where Ag and kg are absolute constants. 

The computational complexity of the algorithms here is measured by the number 
of integer, real, and ring operations (or simply, operations) consumed. This in turn 
can be routinely bounded in terms of bit operations since all the numbers that 
occur in Theorem 11.11 can be expressed using a poly-log number of bits in q and 
\s\. We will specify what it means for a character x to be "given" as an input to 
algorithms at the beginning of fj3] When we write: u L(s,x) can be computed to 
within ±q- x (\s\ + iy x " it means, given s and x, we can find a number L(s, x) such 
that L(s,x) — L(s,x) = e luJ °CQ for some unknowns ujq € R and — g _A (|s| + 1)~ A < 
£o < q~ X (\s\ + iy x - So, in particular, \L(s,x) ~ L(s,x)\ < <T A (N + 1)~ A - With 
this in mind, we prove the following upper bound on the number of operations 
required for computing L(s,x). 



Theorem 1.1. There are absolute constants A\, . . . ,A§, k\, K2, and K3, such that 
for any real number X, any complex number s with 1/2 < Sfts < 1, any positive 
integer q = p^ 1 •••p^ fc (where pj are distinct primes), and any given character 
X mod q, the value of the Dirichlet L-function L(s, x) can be computed to within 
± q- x {\s\ + 1)- A using < A 1 p[ ai/3] ■ ■ -p^ h/ ^ {\s\ + l) 1 / 3 (A + l) Kl log" 1 {q(\s\ + 1)) 
operations on numbers of < A2 (A + l) 4 log 4 (g(|s| + 1)) bits, provided a precomputa- 
tion, that depends on q only, costing < A3 p\ ■ ■ ■ ph log K2 (q) operations on numbers 
of < A4 log(g) bits, and requiring < A§p\ ■ ■ ■ p^ \og K3 {q) bits of storage, is per- 
formed. 
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The precomputation in Theorem 11.11 comes directly from lemma 13.11 which fur- 
nishes a procedure for computing individual values of x to within ±e in poly-log 
time (in q and 1/e) using the precomputed values. Theorem 11.11 assumes that the 
factorization of q is given, but this is not essential since there are algorithms, with 
provable complexity, for factoring q at a cost subsumed by the overall cost of the 
algorithm (e.g. Lehman's method; see [CP] ) . The condition 1/2 < 5Rs < 1 in 
the theorem is not essential, and is inserted partly to simplify dealing with the 
L-functions associated with imprimitive characters; see the discussion preceding 
the proof of Theorem 11.11 in fQ] The upper bound A (log 4 (9(|s| + 1))) for the 
number of bits is generous, and is to simplify various proofs. It comes directly 
from Theorem 13. 4[ and all is required otherwise is 0\(log(q(\s\ + 1))) bit arith- 
metic. The algorithm can be modified easily so arithmetic is wholly performed 
using OA(log(g(|s| + 1))) bits, which is what one should do in a practical version. 
Also, in practice, one typically uses a floating point system for representing num- 
bers, not a fixed point system. This said, the asymptotic running times of our 
algorithms are independent of which system is used. The reason we prefer a fixed 
point system in the exposition is to make it fairly simple to determine the number 
of bits needed to perform basic operations. Last, in practice, it is worthwhile to 
minimize the use of expensive multi-precision arithmetic by carefully using Taylor 
expansions, precomputations, suitable orders of operations, and various tricks. 

Notation. We have been using asymptotic notation. For completeness, let us define 
it explicitly. Following [D, p. xiii], we write f2{x) = 0(fs(x)), or equivalently 
f2(x) <C fs(x), when there is an absolute constant C\ such that |/2(x)| < 61/3(2;) 
for all values of x under consideration. In this paper, the "values of x under 
consideration" is always a set of the form x > C2, where C2 is an absolute constant. 
We write f%{x) = o(f^(x)) when lim fa (x)/ fa (x) = 0, where the limit is always 
taken as x — > 00 in this paper. When we write 0^(.) or o^(l), it means the implied 
constants depend on A. So, for example, when we write 0^(.), it mean that C\ 
and C2 depend on A. If no dependence is indicated, then the implied constants 
are absolute (but for more emphasis, we will frequently state this explicitly). We 
will often refer to poly- log factor in some (positive) parameters x±, . . . ,x r >, which 
means a factor of the form A(\og(xi + 3) + • • ■ + log(av + 3)) K , where A and k are 
absolute constants. 



2. Background and the basic idea 

In |BF[ Theorem 1], Bombieri and Friedlander prove, for a fairly large class 
of L-functions, that it is not possible to approximate a fixed L-function L(s) = 
S^Li c n n ~ s (so big-O constants will depend on L), in a window T < 3(s) < 2T 
and just to the left of the critical line, with an error bounded by 0(T~ e °), eo > 0, 
using a single Dirichlet polynomial J2 n <x c n( x )n~ s , \c\ {x)\ > l/2,c n (x) <n°W, of 
length much shorter than the "analytic conductor" of L(s). Note that the analytic 
conductor terminology usually appears in connection with subconvexity estimates, 
but it also arises naturally in connection with algorithms for L-functions, such as 
those in |Rlj . Following |IK) . the analytic conductor of £(s) can be defined as 
\s\ + 3, and the analytic conductor of L(s, y) can be defined as q(\s + a\ + 3), where 
0:= (1— x{— 1))/2- The precise definition is not important to the asymptotic results 
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discussed here, and the reader may refer to jlKj for the definition for a general type 
of L-function. 

The result of Bombieri and Friedlander assures, for example, that L(a + it,x), 
a < 1/2, T < t < 2T, can not be approximated well by a single Dirichlet polynomial 
of length T 11 -^ 1 ). However, as remarked in |BFj . the behavior in a fixed strip to 
the right of the critical line is quite different: On the Lindelof hypothesis, L(s) 
can be approximated there with an error bounded by o(l) using arbitrarily short 
Dirichlet polynomials. Therefore, if we define /2l(ct), < a < 1 say, as the infimum 
of all the numbers rj such that L(a + it) can always be approximated in T < t < 2T 
with an error bounded by o(l) using a Dirichlet polynomial ^2 n<x c n (x)n ~ s of 
length x — 0(7^), then it is clear that /ii(cr) and /xl(ct) are qualitatively different. 
For example, /Ul(ct) is continuous, whereas is not (assuming the Lindelof 

hypothesis for L{s)). 

If the restriction on the number of Dirichlet polynomials is removed, then one 
can do better. For example, it is possible to approximate many L-functions with 
an error bounded by o(l) using two Dirichlet polynomials, each of length roughly 
the square-root of the analytic conductor. A general method for doing so is the 
smoothed approximate functional equation, which we will discuss shortly in more 
detail. But, first, we remark that in all known algorithms where the square-root 
barrier is broken, one had to ultimately abandon the framework of Dirichlet poly- 
nomials. For instance, the algorithms in |H2j rely on approximations via low-degree 
exponential sums, and so does the algorithm of Theorem ll.il 

While Theorem 11.11 never uses the smoothed approximate functional equation, 
it is still useful to discuss such a general method here since it supplies formulas for 
computing many L-functions, including L(s, \)- Formula (|2.1[) below is actually 
a specialization of a smoothed approximate functional equation formula in |R1) . 
valid for a Dirichlet series with arbitrary coefficients provided the series possesses a 
meromorphic continuation and a functional equation, and satisfies very mild growth 
conditions (so no Euler-product is required). See |R2j for a C++ implementation. 

Specifically, Rubinstein [Rlj provides two formulas for L(s, x), for the even and 
odd cases, which can be combined straightforwardly as follows: Assume \ mod q is 
a primitive character, and let a := (1 — x(— 1))/2, then 

(2.1) 

(f) 2 rfe%*,x)*-' = * B £x(»)G 

1 r(x) V"-TT r ( l - s + a nn2 \ 
+ S a+i pqi/2 2^ W u [ 2 ' S 2 q J ' 

n— 1 v 7 

where G is a smoothing function, expressed in terms of the incomplete Gamma 
function T(z, w), 

/oo 
e-^x 7 - 1 dx , 5J(w)>0, 

r(x) is the usual Gauss sum, 

(2-3) r( X ) :=]T X (n)e 2 ™^, 

n=l 



q 
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and 5 is a certain complex parameter, with a simple dependence on s, chosen 
to cancel out the exponential decay in T((s + a)/2) as |3(s)| — > oo; see [Rlj for 
details. Although the series in (|2.ip are infinite, the weights G(z,w) decay expo- 
nentially fast when $l(w) 1. For a given A, the series (|2.1I) can be truncated after 
g i/2+o A (i)(| s | + ^1/2+0^(1) terms w i t ]j a truncation error ±q" A (|s| + 1)~ A . Once 
truncated, the series can be evaluated term by term to give a numerical approxi- 
mation of L(s, x) accurate to within ±2<?~ A (|s| + 1)~ A say. Therefore, the number 
of terms needed is roughly equal to the square-root of the analytic conductor of 
L(s, x)- Notice that formula (|2.ip involves the evaluation of the Gauss sum t(x), 
which requires summing an additional q terms when done in a straightforward way. 
Also, a direct application of (|2.1j) requires computing roughly the first "square-root 
of the analytic conductor" Dirichlet coefficients. 

In the case of S X (K), where x is primitive, one can use the multiplicativity of 
X, together with a suitable choice of a smoothing function, to always (regardless 
of K) express S X (K) as a series involving q 1 / 2+ °><( 1 ) terms multiplied by t(x); see 
§E\ If is smaller than q 1 ^ 2 , however, then such a series does not lead to a faster 
computation since it is longer than the original sum. In Theorem 13.41 we provide 
a different method for computing S X (K), which leads to asymptotic speed-ups if 
q is smooth enough (and which is, in turn, the main ingredient in the proof of 
Theorem EE]) . 

We sketch Theorem 13.41 Let x mod q be any character, where q = p® 1 ■ ■ -p ^ 
(x need not be primitive). Theorem 13.41 assures that S X (K) can be computed to 
within ±e in about p[ ai ^ ■ ■ .p^ h ^ 3 ^ time, up to a poly-log factor in q and 1/e. 
This running-time improves on q 1 / 2 for many choices of the aj, or roughly when 



(2.4) 11 p^/ 3 l- Q ;/yi« Yl p a ^ 2 -^/*\ 

aje{l,2,4} a 3 -£{l,2,4} 

for some t\ > 0. Notice that [flj/3] — Oj/2 = 1/2 if aj — 1, and it vanishes if 
aj G {2,4}, so the l.h.s. is simply Y[ a =i \fPj-, which is the square-root of the 
square-free part of q. The behavior of the algorithm of Theorem 13.41 as q — > oo 
is very well-controlled, in the sense that power-savings are obtained regardless of 
whether q — > oo through some of the a^'s or some of the pj's or any combination 
thereof. For example, if q — p 3a , then the method requires about p a = q 1 ^ 3 time 
(even if p = 5 say). As another example, if q = pip 3 ," ', then the time requirement 
is about pip%, which represents a power-saving beyond q 1 ^ 2 when p^ 3> piq t2 for 
some €2 > 0. Roughly speaking, if the aj are large, or if the pj are large but with 
exponents nearly divisible by 3, then the running time is about g 1 / 3 +°( 1 ). 

Since our methods exploit the power-full structure of the modulus (via the Post- 
nikov character formula) , it is not surprising that aj = 1, which corresponds to the 
prime modulus case, appears as an exceptional case in (|2.4p , meaning it is a case 
where we do not improve on q 1 / 2 . But the appearance of aj — 2 and aj — 4 as ex- 
ceptional cases is somewhat surprising. The reason we do not obtain a power-saving 
beyond q 1 / 2 when aj — 4, for example, is because we encounter cubic exponential 
sums with possibly large cubic coefficients. There is no general algorithm to com- 
pute such sums faster than required by a straightforward evaluation except for the 
algorithm of |H2j . which is suitable for sums with small cubic coefficients. 
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We illustrate the basic idea of Thcorcm l3.4l in the situation q — p a and simplifying 
to a character sum. For further simplicity, assume K is a multiple of pl~ a / 3 l. So, 

(2.5) S X (K)= J2 X(0 E x(l + fcV a/31 ), 

0<Kp^'^ 0<k<K/pl"/31 
(i,p)=l 

where (to, n) denotes the greatest common divisor of m and n, and I is determined 
by the relation 11 = 1 mod p a . The particular choice of the exponent [a/3] in (|2.5p 
is so that the inner sum there can be expressed as a quadratic exponential sum via a 
specialization of the Postnikov character formula in lemma 15721 Once expressed this 
way, the inner sums can be computed to within ±e in poly- log time (in q and 1/e) 
using the algorithm in 111, Theorem 1.1]; see Theorem 13.31 here. If the exponent 
[a/3] in (|2.5[) is decreased any further, then, in general, the Postnikov character 
formula yields cubic and higher degree exponential sums. In Theorem 13. 4[ this idea 
is generalized to quadratic sums twisted by x- 

3. Computing S x (K) 

Let q — p^ 1 ■ ■ •pt h - (We assume the prime factorization of q is given to us.) Let 
X be a character mod q. We first discuss how x should be "given" as an input to 
the algorithms. To facilitate computation, the following way is convenient. 

Recall that every character x mod q can be expressed as a product of characters 
Xj modpj 3 , 1 < j < h. Assume, at first, that all the pj's are odd, or if pj = 2 for 
some j then a,j < 3. Then the theory of primitive roots applies, and we require 
X to be presented to the algorithm as an /i-tuple of roots of unity (cji, . . . , Uh) 
whose entries satisfy ai™ 3 = 1, where rrij := (f>(p^ 3 ) = p°j (pj — 1), together with 
an /i-tuplc of primitive roots (<?i mod p°j 3 , . . . ,gu mod p^ h ). Given such tuples, the 
algorithm define Xjidj) := w ji f° r 1 — 3 — which determines x uniquely. If pj = 2 
and cij > 3 for some j (so Xj is a character mod 2 a j ) , then the entries corresponding 
to pj in the above tuples are omitted, and we require an additional 2-tuple oj' 2 ) 
whose entries satisfy (uj[) 2 = 1 and (uj' 2 ) 2 = L The reason for this modification 
is that (Z/2 a Z)* is not cyclic if a > 3, and so there is no primitive root. Therefore, 
we rely on the well-known group decomposition of (Z/2 a Z)* to express the odd 
residue classes in the form (— l)"^"" 2 mod 2 a , where v\ and v 2 are integers that 
are uniquely determined modulo 2 and modulo 2 a_2 , respectively. Last, given the 
2-tuple (w^Wj), the algorithms define Xj{~ 1) := w i an( i Xj(5) '■= u' 2 . 

With x thus presented, we supply a fast procedure for computing x at individual 
points. Note that, in general, the problem of determining the value of x mod q at an 
individual point is a hard discrete log problem. There are known sub-exponential 
time algorithms for solving it but their running times are only conjectural: see [Olj 
for a survey of such algorithms. Fortunately, in our case, we can exploit the power- 
full structure of the modulus, which makes the problem computationally simple. 

Lemma 3.1. There are absolute constants K4, k§, and kq such that for any pos- 
itive integer q = p" 1 • • ■ p^' 1 (where pj are distinct primes), any given Dirichlet 
character ^ mod g, any positive e < e _1 ; and any integer < c < q, the value 
of x( c ) can be computed to within ±e using (9(log K4 (g/e)) operations on numbers 
of 0(log(g/e)) bits, provided a precomputation, that depends on q only, costing 
0(pi ■ ■ - ph log K5 (g)) operations, and requiring 0{p\ ■ ■ ■ ph log K6 (g)) bits of storage, 
is performed. Big-O constants are absolute. 
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Proof. It suffices to show how to compute each character Xj mod p^ 3 occurring in 
the decomposition x — Xi ' ' ' Xh- This is because there are only h <C logg such 
characters, and so the cost of computing x(c) is the same as Xj( c ) except for an 
additional multiplicative factor of log q, which falls within the target complexity of 
the lemma. In turn, to compute Xj{ c )> it suffices to solve the discrete log problem 
g x = c mod p^ 3 , because then Xj( c ) can be computed via the formula Xj( c ) — u ji 
which is fast since ujj is supplied to the algorithm via the presentation of x- So the 
difficult part of computing Xj( c ) i s to solve for x, which we do next. (If p divides 
c then Xj( c ) = 0; as this condition can be checked quickly by a single division, we 
may assume that gcd(pj, c) = 1 from now on.) 

Let us first deal with the odd pj case. Recall that <?j is the primitive root 
associated with Xj mod p^ 3 , and is supplied to algorithm via the presentation of 
X- To avoid notational clutter, let p = pj, a = aj, and g — gj. In order to solve 
g x = c mod p a , it suffices to find integers l\ and I2 such that = c p_1 mod p a 

and (g p )' 2 = c p mod p a . This is because, given l\ and I2, one can find integers 
r and s via the Euclidean algorithm (which is computationally fast) such that 
r(p — 1) + sp a ~ 1 = 1, and so x = r(p — l)Zi + sp"^ 1 ^ is a solution. Therefore, 
the discrete log problem mod p a can be reduced to two discrete log problems in 
the (cyclic) subgroups of (Z/p a Z)* of order p a ~ x and p — 1 (which are generated 
by g p ~ l and g p , respectively). Furthermore, the problem in the subgroup of 
order p a ~ x can be reduced to a — 1 discrete log problems in the subgroup of order 
p using a straightforward recursive procedure described by Pomerance [Pom| . For 
the convenience of the reader, let us sketch that procedure here. We may assume 
a > 2, otherwise the problem is either trivial or is already in the subgroup of order 
p. We want to solve (g p ~ 1 ) Ll = c p_1 mod p a . Since l\ can be expressed in the form 
h = &<) + •• ■ + b a -2P a ~ 2 , < b r < p, it suffices to determine the integers b r (which 
are the base-p digits of To this end, suppose bo, ... , 6 r _i are known, and let 

a r = ( ?-\gP-l)- b o—-br-lP r - 1 mod a 

(3.1) 

= g(.P~l)(b r p r + ---+b a -2p a ") m0( Jp a . 

Then, visibly, a r is (p — l)p r -power. So, letting /3 r = a p modp a , we deduce 
that j5 r is a (p — l)p a_2 -power, and thus /3 r is in the subgroup of order p, which is 
generated by g' = g( p ~ 1 ') p mod p a . Therefore, the solution of {g') x = (3 r mod p a , 
which is a discrete-log problem in the subgroup of order p, satisfies x = b r mod p, 
because, by definition, g( p_1 )p v = 1 modp a for any y £ Z, and so 

[3 r = a p "~ r ~ 2 = g ( p - l )^ pT +-+ b ^ pa ~ 2 ) pa ~ T ~' 2 mod p a 
(3.2) 2 

= (g(P-VP«- ) b ^(g') b r modp a , 

Moreover, x determines b r uniquely since, by hypothesis, < b r < p. As for bo, 
which is needed to initialize the procedure, it is found simply by solving (g') x = 
c (p—i)p mod p a , which is again a discrete log problem in the subgroup of order p 
(to which c^p -1 ^" modp a belongs). We note that quantities like a r , (3 r , g', and 
c (p-i)p mod p a can always be computed using repeated squaring mod p a , which 
is fast, otherwise the computation might take too long and the numbers involved 
might contain more bits than allowed by the lemma. 
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In summary, to find l\ it suffices to work with the generator g' of the subgroup 
of order p. One then tabulates its powers: g', . . . , (g') p mod p a , by sequentially 
multiplying by g' modulo p a . It is important that this precomputation does not 
depend on c, but only on q (it is even independent of x) 1 and so the table need 
not be created anew for each different c. The cost of creating the table is about p 
operations and p space (up to a poly-log factor in p a ) . Once done, the value of l\ can 
be determined from the precomputed values using a — 1 repetitions of procedure we 
have described, where each repetition involves a table look-up, which can be done 
in poly-log time in p a , and hence in q (assuming a random access memory model). 

As for li , it suffices to work with the generator g" = g pa 1 mod p a of the subgroup 
of order p — 1. As before, one tabulates its powers g", . . . , {g") p ~ 1 mod p a , so then 
l 2 can be determined by a direct table look-up. The overall cost of this is again 
about p time and p space. 

It remains to show how to compute individual values of \j mod 2 a , where a > 3. 
(The cases a G {1,2} do not represent any computational difficulty.) The main 
task here is to solve for v\ and v 2 such that (— l) Vl 5 V2 = c mod 2 a . The index v\ is 
simple to compute: It is either or 1 according to whether c is 1 or —1 modulo 4. 
As for V2 , it can be computed via a recursive procedure similar to the case of odd 
p (one works in the cyclic subgroup generated by 5, which has order 2 a ~ 2 ). Last, 



The next needed ingredient is the Postnikov character formula, which was derived 
by Postnikov to obtain upper bounds on character sums. It was later re-proved 
by Gallagher [Gaj . (See [I] and |IK| for other formulations.) The formula shows 
that the values of a Dirichlet character \ mod p a along the arithmetic progression 
{1, 1 + p b , 1 + 2p b , 1 + 3p b , . . .} are of the form x(l + P b x) = exp(27ri/(a;)), where 
f{x) is a polynomial with rational coefficients that depends on \i Pi an d b only. 
Lemma 13.21 below, which is a specialization of lemma 2 in [Gaj . shows that b can 
be arranged so that f(x) is of degree at most 2. 

Lemma 3.2. Let x mod p" be a Dirichlet character, and let b := [a/3]. If p is 

an odd prime, then there exists an integer L, depending on \> P, an d b only (so 
independent of x), such that 



for all x € Z. If p = 2 and a > 3, then there exists an integer L\, depending on \ 
and b only ( so independent of x ), such that 



for all x G Z. And if p = 2 and a < 3, then there exist absolute constants —1 < 
L 2 ,L 3 < 2 such that x(l + 2 b x) = (-1)(l 2 x+l 3 x 2 )/2 j Qr aU xeZ ^ 

Remark. We distinguish the conclusion of the lemma for odd p when a E {1,2,4}, 
where we have x(l +p b k) = e 2 " ifc /P a_b if a € {2,4}, and x(l +p b k) = 1 if a = 1, 
which is trivial. 

Proof. Let H be the kernel of the reduction homomorphism (Z/p a Z)* — > (Z/p b Z)*. 
So, H is a subgroup in (Z/p a Z)* consisting of the residue classes congruent to 



x,-(c) = Kr(c4) 



□ 



(3.3) 
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1 mod p b , and H has size \H\ = p a b . Using our model for Z/p a Z, the elements of 
H are identified with the set of integers {1 + p b x |0 < x < p a - b }. 

Assuming p > 2, we construct a character ip of H which generates the full 
character group of H, including the character x\ H , such that ip is given explicitly 
by a quadratic exponential. To this end, define the polynomial f{x) := 2x — x 2 . 
Then, for all x, y € Z, we have 

f{p b x + p b y + p 2b x y) = 2p b x + 2p b y + 2p 2b x y - p 2b x 2 - p 2b y 2 

(3.4) - 2p 2b x y - 2p 3b x 2 y - 2p 3b x y 2 - p ib x 2 y 2 

= fip h x)+f(p h y) modjA 

where we made use of the relation p 3b = mod p a , which holds due to our choice 
b = [a/3] . Consider the following function ip : H — » C defined by 

h . f 2mf(p b x)\ ( 4nix 2ttix 2 \ 

4,(1 + p b x) := exp J p [ P ' ) = exp - -^j . 

Notice this definition is independent of the model for Z/p a Z as it yields the same 
result if 1 +p b x is replaced by 1 +p b x +p a k for all k € Z. Now, by the congruence 
relation (|3.4|1 . we have 

(3.5) V((l + P b x) (1 + P b y)) = <A(1 + P b x) ^(1 + P b y) , 

and this equality holds for all x, y, G Z. Therefore, ip is multiplicative (^ respects 
the group operation in H). Moreover, ip is not identically zero, because ip(l) = 1. 
Hence, V must be a character of H. By a direct calculation, f(p b ) = mod p b , 
or, equivalently, f(p b )/p b is an integer. Moreover, since b > and p > 2, we have 
/(p b ) = 2p b — p 2b ^ modp b+1 , and so f(p b )/p b is relatively prime to p. This 
immediately implies that the values ip(l +p b ) u = e ^MJ(p b )/p b )/p"-\ 0< u < p a ~ b , 
are all distinct. In particular, ip has order p a ~ b , which is the same as the order of 
H. Therefore, the powers of ip span the full character group of H . 

But x\h i s a character of H. Hence, there is an integer L such that \\h = ^ '■ 
To find L, note that x(l +P b ) = exp(2mB / p a ~ b ) for some integer B depending on 
X, p, and b only. (In our application, B can be determined quickly using lemma [3~T] ) 
So, L can be computed by simply solving the congruence L f(p b )/p b = B mod p a ~ b , 
which yields 

(3.6) L = B2-p b mod p a " b . 

It remains to consider the case when the modulus is 2°. If a > 3, then the 
same derivation as in the odd prime case applies except one uses the polynomial 
fi(x) — x — x 2 /2 instead of f(x), which gives 

(3.7) L! = B1- 2 b -! mod 2 Q ~ b . 

(Notice fi(x) consists of the first two terms in the Taylor expansion of log(l + x)). 
If a < 3, then the previous proof does not go through because 6=1 and so the 
condition fi{p b ) ^ mod p b+1 fails. Nevertheless, if a < 3, then we can find integers 
L 2 and L 3 such that x(l + 2 b a;) = (-1)(l 2 x+l 3 x 2 )/2 _ Specifically, if x is the principal 
character, which is the sole character if a = 1, then take L2 = L% = 0. So, we may 
assume x is n °t principal. If a = 2, then there is a single non-principal character, 
for which we take L2 = 2, L3 = 0. And if a = 3, then take L2 = 1, £3 = —1 or 
L 2 = 2, L 3 = 0, or L 2 = 1, £3 = 1, according to whether x(3) = 1 and x(5) = — 1, 
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or x(3) = —1 and x(5) = 1, or x(3) = —1 and %(5) = —1, respectively. The validity 
of these choices of L2 and L3 can be verified by inspection. □ 

The last needed ingredient is the following algorithm for computing quadratic 
exponential sums (Theorem 1.1 in |H1| ). 



Theorem 3.3. There are absolute constants Aq, Aj, Ag, K7, and Kg such that for 
any positive e < e -1 , any integer K > 0, any integer j > 0, any a, ft G [0, 1), and 
with v :— i/(K,j, e) = (J + 1) log(K/e), the value of the function 

kJ 2^ k e 

0<k<K 

can be computed to within ± A 6 v K7 e using < Aj v K& arithmetic operations on num- 
bers of < A s v 2 bits. 



Since the algorithm of Theorem 13.31 will form an essential ingredient in Theo- 
rem 13.41 next, it is useful to detail it slightly further here for more completeness. 
The algorithm involves iterating the following maps (taken in order, and where, 
by an abuse of notation, occurances of K , a, ft, and v in separate maps do not 
necessarily denote the same quantity): Let 8^ denote kronecker's delta, and let 
< {x} := x — [x\ < 1 denote the fractional part of x, then using the periodicity 
of the complex exponential 

(K, a, ft, v) — >(K,a-[a\,ft- [ft\ , v) 

(3.8) (K, a, ft, v) — > (K, a, ft, v) 8 < 1/4 , + (K, a, ft -I, v) <5^> 3 /4 

+ {K, a + (1 - 2tf Q > 1/a )/2, ft - 1/2, v) <5 1/4 <^<3/4 , 

then conjugating the quadratic sum if necessary (v € {0, 1} tracks whether conju- 
gation is used) 

(3.9) (K, a, ft, v) — ► (K, -a - [-a\ ,-0,1- v) 8 0<o + (K, a, ft, v) Sp> , 

followed by an application of van der Corput methods (and using the self-similarity 
of the Gaussian) 

(3.10) (K,a,ft,v) — ► (La + 2ftK\,{a/(2ft)}, {-1/(4/3)}, v) , 
and, last, conjugating the sum back if conjugation was used in (13.91) 

(3.11) (K, a, ft, v) — > (K, -a, -ft, 1 - v) S v =i + (K, a, ft, v) 8 v =o , 

where we have assumed ft =^ 0, but actually the algorithm stops iterating the 
maps when —1/K < ft < 1/K, in which case it falls back on an Euler-Maclaurin 
summation method. The importance of the maps (|3.8[) . (|3.9|) . and (|3.11[) is they 
ensure we can always reduce to the case < a < 1 and < ft < 1/4. The 
transformation (|3 . 1 0[) requires a careful computation of a certain "remainder term" , 
which is shown to be doable in poly-log time, and which consumes the majority of 
the computational effort. In addition, each iteration yields a new linear combination 
of sums F(K,j',a',ft'), < j' < j, whose coefficients can be computed in poly- 
log time. The algorithm uses <C log if iterations because the length of the sum is 
guaranteed to be halved each time. More precisely, since < ft < 1/4, the length of 



2 In our application, we call a function W : R — > M a self-similar function if there exist constants 
a > and A such that W(x) = \W(ax). 
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the new sum satisfies [a + 2/3K\ < K/2 + 1, and therefore at most log Kj log 2 + 2 
iterations are needed. 

By combining lemma [3TTI lemma I3T21 and Theorem 13. 31 we obtain the following 
algorithm for computing theta sums twisted by a character \. This algorithm is 
how the power-savings in computing L(s, x) will be achieved in Theorem 1 1 . 1 1 later . 

Theorem 3.4. There are absolute constants Ag, . . . , A12, Kg, . . . , kh, such that for 
any any positive integer q — p" 1 • • ■ (where pj are distinct primes), any given 
character \ mod q, any positive e < e , any integer K > 0, any integer v, and 
any integer j > 0, and with v\ := vx(K, q, v, e) = (j + l)log(qK(\v\ + l)/e), the 
function 

(3.12) S x (K,v,j;a,l3):=^j £ W X (v + k) e 2 ™ ak+2 ^ k \ 

0<k<K 

can be computed to within ±e using < Agp[ ai ^^ ■ ■ -p^ h ^ 3 ^ ^ 9 operations on num- 
bers of < A10 v\ bits, provided a precomputation, that depends on q only, costing 
< A\\p\ ■ ■ ■ Ph log Kl0 ((?) operations, and requiring < A\ip\ ■ ■ ■ Ph log Kl1 (q) bits of 
storage, is performed. 

Remark. The precomputation requirement comes directly from lemma [XT] k 10 and 
Kn are the same as K5 and kq in lemma [3TT1 respectively. 



Proof. Since x nas period q, we have x( n ) = where n := n— [n/q\ . As h can 

be computed in poly-log time (in q and n), and as < n < q, then we only need to 
know how to compute x( n ) f° r < n < q. By lemma [3~T1 once a precomputation 
costing 0(pi ■ ■ - ph (\ogq) K5 ) operations and requiring 0(pi ■ ■ - ph (\ogq) K6 ) bits of 
storage is performed, the value of x( n ) f° r an y < n < q can be computed to within 
±e/(2K) using 0(log Ki (qK/e)) operations on numbers of 0(\og(qK/e)) bits using 
the precomputed values. Since such a precomputation is permitted by the theorem, 
we may assume from now on that x{ n ) can be computed to within ±e/(2K) for 
any < n < K + v in < v^ 1 + \og(\v\ + 1) < ^ 4+1 time. 

Let us first prove the lemma in the simpler situation v,j, a, ft = 0; i.e. for S X (K). 
To this end, define C := C q = p[ ai/3] ■ ■ .p[ a " /31 and K\ := K w = \{K - l)/C]. 
Then, split the range of summation in S X {K) into arithmetic progressions 

(3.13) S X (K)= J2 X(l) E xa + ick). 

0<1<C 0<k<K, 

Now, x = Xi- --Xh, where Xj modp" 3 . Sox(l+^Cfc) = xi(l+ICfc) • ■ • XhO- + lCk). 
Applying lemma l3T2l to each XjiX+ICk) separately, with x = Xj> a = a j> b — \ a j/3\ , 
and x — lCk/p b (note that x is an integer since, by definition, p b divides C), we 
can express each XjO- + ICk) as a quadratic exponential in k. Each application 
of lemma 13.21 involves two steps. First, one determines the integer < B < p a ~ b 
satisfying x(l + p b ) = exp(2iriB /p a ~ b ), which is straightforward since x(l + p b ) 
can be computed via lemma I3TT1 and using the already precomputed look-up tables. 
Second, one solves the congruence (I3.6[) or (|3.7p for L, which can be done fast via 
the Euclidean algorithm. Put together, the inner sum in p,13[) can be expressed in 
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the form 



(3.14) 



Q<k<K l 0<k<K t 

_ e 2iria 1 k+2TTil3 1 k 2 

0<k<Kt 



where «i,/3i S [0, 1) are constants, depending on x, C, and I only (so independent 
of A:), whose values can be determined quickly by solving at most h congruences 
like (|3.6|) and p.7j) . By Theorem 13. 3[ the exponential sum on the r.h.s. of (|3.14p 
can each be computed to within ±e/(2C) in poly- log time (in K and C/e). Since 
there are at most C such sums to be computed, the lemma follows for S X (K). 

We extend the previous method to the generalized sum S x (K,v,j;a, jS). To 
begin, define the following coefficients di^ r := di )r: j t c,K via the binomial expansion 

(3.15) K~ j (l + Cky =: ^ d^ r {k/Ki) r . 

r<j 

So di jr are explicitly given by 

(3.16) di, r = 



j\ p- r C r (Ki) r 



rj Ki 
Then, lemma I3~2l yields 
(3.17) 

S x (K,v,j;a,(3) 

= J2 X(l + v)e 2 " al+27rtf312 K-^l + Cky X (l+TT^Ck)e 2 ^ a+2l ^ ck+2 ^ c2k2 

0<1<C 0<k<K, 
(l+v,q)=l 

_ X(l + v ) e 27rial+27ri ^ 2 ^ dl ' r ^ k r e 2ni((a+2lf))C+ a2 )k+2Ki(pC 2 +fo)k 2 

0<KC 0<r<j 0<k<K, 

{l+v,q)=l 



where U2,fi2 G [0, 1) are constants, depending on x, C, and I + v only (so indepen- 
dent of fc), whose values can be computed quickly by solving < h congruences like 
(1331) and (32). 

We digress briefly to discuss how to compute d^ r - There are several ways for 
doing this; the following suffices for the current exposition. One precomputes the 
factorials r!, < r < j, exactly, by sequentially multiplying the integers 1, . . . ,r. 
Since r < j, this can be done using < j operations on integers of <C {j + 1) log(j + 
1) < (j + 1) 2 bitsB and requires <C (j + l) 3 bits of storage, which is allowed by the 
theorem. The binomial coefficient can then be computed to to within ±e/ (4C(j+l)) 
in the form j\/(r\{j —r)\) which requires three operations on numbers of <C (j + 1) 2 
bits using the precomputed values of the factorial. Also, each P~ r , C r , (K[) r , and 
K^, can be computed exactly using 0(1) operations on numbers of 0((j+l) log(qK) 
bits. In practice, however, one can pay attention to the order of operations when 



3 This step (and similar ones involving binomial coefficients) requiring (j + 1) log(j + 1) <C 
C? + I) 2 ^ v t bits, is essentially the reason why this theorem is stated with the upper bounds 
O(v^) on the number of bits (and Theorem 13.31 ( [Hll Theorem 1.1]) was stated with the upper 
bound 0(u 2 ) on the number of bits). Otherwise, all is required is -C j + \og(qK /e)-hit arithmetic 
(and < j + log(i<T/e)-bit arithmetic, respectively). It is plain that one can prove this is in fact all 
is required. 



1(5 



G.A. HIARY 



computing d^ r in order to minimize the number of required bits. Consequently, 
it useful to separately consider the possibilities Ki > j and Ki < j. If Ki > j, 
which is typically the case in applications, then by the facts (■ 7 ) < j^~ r , I < C, 
and K < CKi, we have d^ r < (j/Kiy~ r < 1. So di r are typically of bounded 
size. If K\ < j, then one can compute the sum directly at a cost of <§; KiV\ <C v\ 
operations, which falls within the target complexity. So let us assume that j < K. 

To conclude, define v := i>(K,j,e/q) = (j + l)\og(qK/e). Theorem 13.31 ensures 
that each quadratic sum in (|3.17|) (the inner-most sums in the last line) can be 
computed to within ±e/(4C(j + 1)) using 0(D Ks ) operations on numbers of 0(v 2 ) 
bits (since, by assumption, j < K). Since there are < C(j + 1) such sums, and 
since the precomputed look-up tables required by lemma [3~T1 are already available, so 
each x(^-t-w) can be computed to within ±e/(4C(j + 1)) using 0(v^ l+1 ) operations, 
then, on noting v < i>x, we see that the overall cost of computing S x (K,v, j; a, /3) 
to within ±e is 0(C(^i + ^ 4+1 + ^i s+1 )) operations. The theorem follows. □ 

4. Application: computing L(s, x) 

We would like the starting point in this section to be an unsmoothed approximate 
functional equation for L(s, x) (i-e. a "Riemann-Siegel" type formula). This is 
because unsmoothed formulae make it far simpler to apply subdivisions to the 
main sum, as we will do. On the downside, unsmoothed formulae of length square- 
root of the analytic conductor are quite complicated to derive. One might appeal 
to the main theorem in |Daj . for example, which provides an unsmoothed formula, 
but which does not apply when s is small, and the explicit asymptotic constants 
in its remainder term have not been worked out explicitly. Fortunately, given 
Theorem 13.41 we can circumvent these difficulties easily, at least for the purpose 
of the theoretical derivation. The reason is that, Theorem 13.41 will yield the same 
power-saving for computing L(s,x) even if one starts with a main sum of length 
[q d (|s| + 1) ], where d is any fixed number, because it will be applied locally, 
to blocks in the main sum, and it depends on the block-length and the required 
precision is a poly-log way only. 

To this end, let x mod q be a non-principal character, where q = p^ 1 ■ ■ -p^ 1 . We 
first consider the case when x is primitive. As before, let o := (1 — x(— 1))/2, so a 
is or 1 according to whether x is even or odd. Define 

(4-1) ^,x):=g) f r(^)x( S ,x), 

and £(,s, x) '■— x) — £( s >x)- We have the following functional equation 

ja 1/2 

Therefore, we may restrict our computations of L{s, x) to the half-plane 5R(s) > 
1/2 since values of L(s,x) elsewhere can be recovered routinely by the functional 
equation. Here, the Gauss sum t(x) can be computed to within ±e by Theorem l3.4[ 
which consumes about p[ ai ~^ ■ ■ ■ p^ h ^ time, up to a poly-log factor (in q and 1/e). 
While the functional equation is valid for primitive x only, it will be apparent that 
our use of it is not essential, provided we restrict the computations to the half 
plane 3?(s) > ctq, where <to > is fixed. Alternatively, one can use the expression 
for L(s,x) in terms of L(s, xi), where xi is the primitive character inducing x, 



COMPUTING CHARACTER SUMS 



17 



to enable the functional equation for L(s,xi) to be used instead. However, for 
simplicity, we will assume 1/2 < SR(s) < 1, say, from now on. 

We use the Polya- Vinogradov inequality to reduce the computation of L(s, \) to 
computing a main sum ^2 n<M x( n ) n ~ s i where M is chosen according to the desired 
precision. Specifically, if \ mod q is primitive then | ~^2 N <n<N x( n )\ < 'Z^logg, 
and if x mod q is induced by the primitive character xi mod qi, then on combining 
the estimates (see Chap. 23]) |E^<n<JV a x(n)| < 2"(<^)( gi ) 1/2 log( 9l ) and 
2«(«/9i) < d(q/qi) < 2{q/qi) 1 / 2 , where is the number of distinct prime factors 
of r, we obtain 

(4.3) | J2 X(n)\<2(q/qi) 1/2 (qi) 1/2 \0g(q 1 )<2q 1 ^\0gq. 

JVi<7l<jV 2 

Therefore, by applying partial summation to E n >M x( n ) n_S i we arrive at 
(4-4) L(s, X )= £ ^+K, 

l<n<M 

where 

("» «|»^')' 

For example, if M > (6q log g) 1 / 3 ^) , then the main sum in (|4.4p approximates 
L(l/2, x) to within ±g _1 ^ 2 . As another example, if q > 900 say, then we can ensure 
\TL\ < q~ x {\s\ + l)- x by taking M > q d (\s\ + l) d , d = (A + 1)/K(s). Notice that, if 
the desired bound on 7?. is reduced by a multiplicative factor of ei , then d changes 
to d— log(ei)/(3f(s) log(g(|s| + 1))), and so d grows very slowly (logarithmically) as 
we tighten the bound on 1Z. 

We conclude that to prove Theorem 11.11 it suffices to compute the main sum 
in (|4.4p with a suitable M. Before presenting the proof, we take a brief detour 
to emphasize the following. While the proof will yield the asymptotic complexity 
claimed in the theorem even when the length of the main sum is M > <7 100 (|s| + l) 100 
say, it is better in practice to start with a shorter main sum than that, which 
could be provided by a "Riemann-Siegel" type formula for L{s, x) with explicit 
asymptotic constants in the remainder term. On the other hand, if the available 
implementation of the algorithm of Theorem 13.41 is well-optimized, and supposing, 
for instance, that one wishes to compute L(l/2,%), or perhaps only the low-lying 
zeros of L(l/2 + it, x), to within ±q~ x / 2 (so that choosing M w (glogg) 2 suffices), 
then this issue might have a relatively little impact on the overall running time 
because , as mentioned before, the dependence of Theorem 13.41 on the length of 
the block and the desired precision is only poly-log. Therefore, in the proof of 
Theorem 1 1.1[ we prefer to retain the simplicity and uniformity provided by (|4.4|) . 
as well as its indifference to whether the character is primitive or not. 

Proof of Theorem Our goal is to prove an upper bound on the number of oper- 
ations required to compute L(s, x) to within ±<7 -A (|s| + l)~ A , where x is a character 
mod q, and q has prime factorization q = p" 1 ■■■pf l . The character x should be 
presented to the algorithm as we detailed in Sj3j Notice that the presentation of x 
includes the factorization of q. We assume, for convenience, that q(\s\ + 1) > 10 3 . 

In JO]), we choose M = \q d (\s\ + l) d ] where d = (A + 2)/9ft(s). This choice of 
M ensures, via estimate (|4.5|) . the assumption q(\s\ + 1) > 10 3 , and the hypothesis 
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1/2 < 3?(a) < 1 in the statement of the theorem, that \K\ < 0.1q _A (|s| + 1)~ A . 
Next, we use lemma 13. 1[ and the periodicity of to enable the evaluation of 
X(n) to within ±0.1g~ A (|s| + 1)~ A /M for any < n < M using 0((X + d + 
1) K4 log K4 (g(|s| + l))) operations on numbers of 0((A+rf+l) log(g(|s| + l))) bits. The 
lemma requires precomputing look-up tables, which is done only once throughout 
this proof. The precomputation costs 0(pi ■• -Ph log K5 (g)) operations, and requires 
0(pi ■ ■ ■ Ph log Ke (<z)) bits of storage, which is permitted by the theorem. 

Next, let Mi = Mi >q>s := p[ ai/3] ■ ■ -p[ ah/3] \(l + \s\) x / 3 ], and divide the main 
sum into an initial sum and a "bulk sum" 

(4.6) V X ^ - V X ^ I V X ^ . 

l<n<M l<n<Mi Mi<n<M 

By appealing to lemma I3TT1 to compute individual values of \, we see that the initial 
sum can be evaluated directly, to within ±0.1q _A (|s| + l)~ A , using 0(M\ (A + d + 
1) log(q(|s| + 1))) operations, which falls within our target complexity. So it remains 
to deal with the "bulk sum" , which is where the power-savings will be achieved. 
We perform the following dyadic subdivision Q of the "bulk sum" 

M ± <n<M l£Tn£l 

where I is the set of consecutive subintervals / that partition [Mi,M). Each 
subintcrval in / is of the form / = [N,2N), N E [M±,M), except possibly the 
last subinterval, which is of the form [N, M). In explicit terms, if we define do :— 
[log(M/Mi)/ log 2J, then X= {[2 r M u 2 r+1 M 1 ), < r < d Q } U {[2 d °Afi, M)}. Note 
that 

(4.9) m<^^l + l<10 d log(«(N + l)). 

Therefore, if one plans on computing each inner sum in (|4.8p separately, as we 
will do, then computing the full sum will multiply the cost by an extra factor of 
10dlog(q(|s| + 1)) only, which can be absorbed by our target complexity. Given 
this, it suffices to show how to compute each of the sums J2 n ei x( n ) n ~ 3 ■ 

For each subinterval / = [N, 2N) (except possibly the last one, which, in any 
case, is dealt with similarly), we define K := Kn_ s — \N/{\s\ + l) 1 ' 3 ]. We let 
V := Vn,k(I) = {N,...,N+ [N/K\K}, so V is a set of equidistant points in 



4 The following subdivision scheme is more efficient in practice than a dyadic subdivision (by a 
constant factor) because it yields larger blocks to feed into Theorem 13.41 later: Let vq = A^l> and 
sequentially define K r := min{ |"£v/(|s| + 1) 1 ^ 3 ],M — v r }, v r +i := v r + K r , to obtain 

(4.7) J2 x{n)n-"= £ £ X {v r + k)(v r + k)' s , 

M 1 <n<M 0<r<RO<k<K r 

where R := R s M\ M can be shown to satisfy R <C (\s\ + l) 1 / 3 . The reason we use a dyadic 
subdivision in the proof, even though it is less efficient, is because it is likely more familiar, and 
so it might be marginally simpler. 
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[N, 2N) separated by distance K. Therefore, we have: 
(4.10) 

\ " X{ n ) _ ST^ ,,fV,\ „-*logn 



nel N<n<2N 



E e ~ 



X(v + k)e~ s M«+*0 + V xW + k) e~ s lo s( v '+k) ^ 



E E ^ ^ 

vGV0<k<K Q<k<K' 



where the length of the second (tail) sum in ()4.10j) satisfies < K' < K. Now, by 
definition, N/K < ( | s | H- 1) 1 / 3 , and in particular |V| = [N/K\ + 1 < (|s| +1)V 3 + 1. 
So, to prove the theorem, it suffices to show that each inner sum in (|4.10[) (as 
well as the tail sum, which is handled similarly) can be computed in poly-log time. 
This will be accomplished via Theorem 13.41 as follows. We apply the expansion 
log(l +x)=x- x 2 /2 + x 3 /3 to log(l + k/v), to obtain 

V x(v + k) e- sl °z( v+ V = e - slogv V x(« + fc)e" slog(1+fe/,,) 

0<k<K 0<k<K 

(4-11) ~~ 2 3 

= e -slo B v J2 x(v + k)e- s ^-£* + £s— 



0<fc<-ff 

By our choice of K, and the facts v > N and N > M±, it follows that \k/v\ < 
(|s| + l) -1 / 3 , and so the cubic and higher terms in s log(l+k/v) = sk/v— s(k/v) 2 /2+ 
s(k/v) 3 /3~ ••• are bounded by 0(1). More precisely, \s\(k/v) 3+jl / (3 + ji) < 
(\s\ + l)~ j1 / 3 < (3/2)~ j1 / 3 . Thus, using Taylor expansions (in the third equality 
below), we obtain 

(4.12) 

e _ s( £_j^ + j^_...) = e -msi k+ miik 2 e -£^ife+|i|ife 2 --^fe 3 +... 



C 3 fc 3 _| _L sK J k J 



, ^0 K J + Es v k jQ 

_ iS (') ; : | ia (°) fc2 V ^ k 3 

— e v a« z j,s,v,J J7J + ^s,v,k,J + £-a,v,k,J ' 

0<j<J 

Since \s\(k/v) 3+h /(3 + ji) < (3/2)-^/ 3 , the first truncation error S' 3>vM sat- 
isfies |^ iV)fc)Jo | < 0.1 g- A (|s| + 1)- X /(MK) when J > J , where J < (d + 
A + 1) log(g(|s| + 1)). Thus, it suffices to choose Jo = [~Jo~|- Also, \£ s ,v,k,j\ < 

0. 1 q- x (\s\ + 1)- X /(MK) when J > J, where J < (d + A + 1) log(?(|«| + 1)). So it 
suffices to take J = \J~\ . To see why this bound on J holds, consider the function 
rj(w) := e TlW+ - +Tj o wJ \ where n := -$t(s)K/v, t 2 := $l(s)K 2 / (2v 2 ), and Tj := 
(-lp' +1 siP/(jV) for 3 < j < J , note t u t 2 < 1, and recall |r 3+J1 | < (3/2)-^/ 3 , 
then apply Cauchy's theorem | Zj iS:Vi j \ < (2Tr)^ 1 \f,:_ 5 , i r](w)/w J+1 d'w\ -C (5/4) _J . 
We conclude that both Jo and J can be taken of poly-log size. Moreover, the coeffi- 
cients Zj iS , Vi j (which are independent of k) can be computed fast as follows. Let Tj 
be defined as before, and define the polynomials P r {w) via the recursion: Pq(w) := 

1, P r (w) = (P^-iH + Pr-i{w)Q'{w))/r, where Q{w) := t\w H V t Jo w j °, and 

P'(w) and Q'(w) denote the derivative with respect to w. Then it is fairly easy to 
see that Zj tS>Vt j = Pj(0). And to compute Zj, StVt j , < j < J, it suffices to repeat 
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the said recursion J+l times, noting that each repetition requires <C (J + l)(Jo + l) 
operations only because it suffices to keep track of merely the first J+l terms in 
P r (w) throughout (we can discard the rest because P r (w) will be differentiated at 
most J times, then evaluated at zero). It follows that Zj^ s , v ^j , < j < J, can be 
computed at atotalcost < (J+1) 2 (J+1) < (d + A + l) 3 log 3 (g(|s| + l)) operations. 

By plugging (|4.12j) back into (|4. 1 1[) and interchanging the order of summation, we 
see that the last sum in (]4. 1 1[) can be rewritten, to within ±0.2q~ x (\s\ + 1)~ A /M, 
a linear combination, with quickly computable coefficients, of J + 1 sums of the 
form 

(4.13) — F X (« + fc) e 27rMfe+27r ^ fc2 , 

0<k<K 

where < j < J< (d+A+1) log g(|s| + l), a = -Q(s)/{2ttv), and (3 = Qf(s)/(47re 2 ). 
Letting := (J+l)(d+A+l) log(g(|«| + l)) < (A + l) 2 log 2 ( ? (|s| + l)), it follows by 
Theorem l3.4l that each sum (|4. 13[) can be computed to within ±0.1 g~ A (|s| + l)~ A /Af 
using <C p[ ai ^ 3 ^ ■ ■ •pj j a '*/ 3 l i/^ 9 operations on numbers of <C v\ bits. Since there 
are < |X| (|V| + 1) ( J + 1) <C (|s| + l) 1/,3 f| such sums to be computed, then, on 
accounting for all the truncation and finite precision errors introduced so far, we 
see that L(s,x) can be computed to within ±g _A (|,s| + 1)~ A using a further <C 
p r«i/31 . .. p K/3l (| s | + 1 )i/3 jy «o+2 operations . The theorem follows. □ 

In anticipation of a practical implementation of the algorithm, let us make a 
few more comments. In order to improve efficiency, one could assume a type of 
pseudo-randomness in the round-off errors that result, say, from summing a large 
number of terms (mainly, the sum over V). For example, one might model the 
round-off errors by an independent sequence of random variables with mean zero, 
which gives square-root cancellation in the aggregate error; see |Q2j . This way, 
the aggregate error is bounded in the Z 2 -norm (root-mean-square) rather than in 
the ^-norm. Such a model is suitable in large-scale computations that focus on 
statistics of zeros (e.g. |Q2| , |Goj . and |H3j ) because it is robust, it increases the 
efficiency noticeably, and the Z 1 -error obtained without assuming it may still suf- 
fice for many purposes (such as verifying the Riemann hypothesis at relatively low 
height, or computing moments or zero statistics - even at large height). However, 
when checking the Riemann hypothesis in neighborhoods of very close zeros, it 
might appear risky to rely on a model for the aggregate round-off error that in- 
volves square-root cancellation since the Riemann hypothesis is itself essentially 
about square-root cancellation, even though these cancellations occur for different 
reasons. Therefore, to deal with such instances, it is useful to have at least one 
implementation that controls the aggregate round-off error in the ^-norm while 
also minimizing the use of multi-precision arithmetic (so as to avoid unnecessary 
increases in the running time that occur when the built-in data types no longer 
suffice) . This can be done (well) with the aid of multi-precision packages (like MPFR 
and GMP). Ultimately, a reliable check for computational correctness is replication 
via independent means and implementations. 

5. Comments on the general modulus case 

While the method presented here for computing S X (K) does not yield a power- 
saving beyond q 1 ^ 2 when q G {p,p 2 ,p 4 }, and in fact requires time in the case 
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q = p, it is consistent with the existence of a general q 1 / 3 +° A ( 1 ) algorithm. In the 
^-aspect (i.e. with s = a + it and thinking of t large) there exists such an algorithm, 
as well as a faster one performing in t 4 / 13 +°- > >( 1 ) j which relies on computing cubic 
exponential sums; see H2 . But notice that the similarities found between the 
algorithms in the t and q aspects rest heavily on the power-full structure of the 
modulus, which suggests that significant additional algorithms are needed in the 
prime modulus case (and, likely, the square- free case). 

In the remainder of this section, we give a general method for computing S x (K) , 
where x mod q is any character, and with no assumption about the factorization 
of q. This method is primarily of interest in the range q 1 ^ 2 < K < q. (One can 
always reduce to the range K < q by the periodicity of x and the observation 
So<n<<? x( n ) — if x is nonprincipal) It might be illuminating, though, to first 
consider the following more general situation. Let cto, . . . , ckr-i be any sequence of 
numbers. If a r satisfy some favorable conditions, then the following procedure for 
computing ^2 r<L a r , where L < R, can be faster than a straightforward evaluation. 
To this end, let do, ■ ■ • , <Xr-i denote the dual sequence under the discrete Fourier 
transform, so 

(5.1) a m := ^2 "r-e . 

0<r<_R 

The domain of definition of a m can be extended to all of m G Z by setting a m := 
dm mod r, and we note that a m is already defined for any m£Z. Then, we have 
the following functional equation, valid for W in Schwartz class, say, 

oo oo 

(5-2) J2 <* m W{m) = - ]T 

m— — oo m— — oo 

where W(x) := J R W(y) e - 2mx y dy. 

To compute X)r<L a » > ' choose W := I * H, where / is the indicator function 
of [0, L], and H(y) := e~^ y2 / R . Notice that I(y) is very well-approximated by 
W(y), to within ±R~ X say, except for two intervals of length i? 1 / 2 +° A ( 1 ) near y = 
and y = L. So the difference between the l.h.s. of (|5.2[) and J2 r <L ar can ^ e 
computed to within LR~ X < R~ x+1 in R 1 /^ ^ 1 ) time. Also, W(y/R) decays, 
with y, like e~^ y < R , which implies that the r.h.s of (|5.2I) can be made of length 
i?i/2+o A (i) by truncating the series with an error <C R~ X - Letting R € ° denote 
the cost of computing an individual point ay, and letting R s ° denote the cost of 
computing an individual dual point otj, we see that ^2 r<L ct r (and more generally 
^ Lo<r<Lo+i a,.) can be computed, via (|5.2I) . in about R 1 /^ 5 " time, instead of 

In the case of a primitive Dirichlet character x mod q, we take R = q, so the 
dual is x( m ) — x( m ) r (x)0 If, in addition, x is rea b then we have a simple formula 
for the Gauss sum r(x), and so it is easy to see that x can b e computed provably 
quickly (in poly-log time) by appealing to quadratic reciprocity. We conclude that 
Er<iX( r ) can always be computed in g 1 / 2 +° A ( 1 ) time for real primitive \. In the 
case of a general character, we do not have quadratic reciprocity, but we can still 



If x IS not primitive, say it is induced by xi mod qi, then, for (m, q) = 1, we have x{m) = 
x( m )Xi (q/Qi )m( < ?/9i) t (xi ) where /i is the Mobius function. Notice that, if \ is real then the 
non-vanishing of \ tells whether q is square-free without explicitly factoring q. 
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express the Gauss sum as a ratio of series involving g 1 / 2 +° A ( 1 ) terms by applying 
formula (|5.2j) with a n — x( n ) an( i with W(x) a self-similar function with sufficient 
decay (e.g. e~^ x t q ). 

We also mention the following. Let p be an odd prime, let x be the real primitive 
character mod p, and let 

(5.3) g{x) : = — = 2^ e " ■ 

V P 0<k<p 

It is well-known that if \ 1S even and (m,p) — 1, then x( m ) — 9( m )- Similarly, if 
X is odd and (m,p) = 1, then x( m ) — ~*<?( TO )- Assuming x is even, say, we extend 
the domain of definition of x to all of x € K, setting xi x ) := 9( x )- By quadratic 
reciprocity, xi x ) can be computed, for integer x, in poly-log time. A consequence 
of the algorithm for computing quadratic exponential sums in Theorem 13.31 is that 
one can still compute x( x ) m P°ly~l°g time for any i£l 
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